Skip to content

fix: proxy-webhook selector matches operator pods#3228

Open
bowling233 wants to merge 2 commits into
tektoncd:mainfrom
ZJUSCT:main
Open

fix: proxy-webhook selector matches operator pods#3228
bowling233 wants to merge 2 commits into
tektoncd:mainfrom
ZJUSCT:main

Conversation

@bowling233

Copy link
Copy Markdown

Changes

Fixes #3227

Both the tekton-operator and tekton-operator-proxy-webhook Deployments
label their Pods with name: tekton-operator. The
tekton-operator-proxy-webhook Service uses this same label as its only
selector, so it inadvertently load-balances traffic across both Deployments.
Because tekton-operator pods do not serve on port 8443, ~50% of admission
webhook requests fail with connection refused. Since the
MutatingWebhookConfiguration has failurePolicy: Fail, each failure
immediately rejects TaskRun Pod creation.

Changes:

  • cmd/kubernetes/operator/kodata/webhook/webhook.yaml: rename the
    proxy-webhook Deployment's matchLabels selector and pod template label
    from name: tekton-operator to name: tekton-operator-proxy-webhook;
    update the Service selector to match.
  • cmd/openshift/operator/kodata/webhook/webhook.yaml: same change for the
    OpenShift manifest.

The existing app: tekton-operator label is preserved on both Deployments.
No other resources are affected.

Alternative considered: adding a set-based (NotIn) expression to the
Service selector to exclude tekton-operator pods. This was not viable
because Kubernetes Services only support equality-based (matchLabels)
selectors.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Note on tests: This bug only manifests at the Service routing layer
(i.e., ~50% of requests land on a pod with no server). There is no
in-tree unit or integration test that exercises which pods a Service
selects. A targeted e2e test verifying that the proxy-webhook Service
endpoints do not include tekton-operator pods would be a good addition,
but is left for a follow-up.

Release Notes

Fix: the tekton-operator-proxy-webhook Service selector incorrectly matched
tekton-operator pods in addition to proxy-webhook pods, causing ~50% of
admission webhook requests to fail with "connection refused" and TaskRun Pod
creation to be rejected. Users on v0.78.1 can work around this until upgrading
by adding `pod-template-hash: <webhook-pod-hash>` to the Service selector.

Copilot AI review requested due to automatic review settings February 19, 2026 10:19
@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 19, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented Feb 19, 2026

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: bowling233 / name: Baolin Zhu (49cacf1)

@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a production routing bug where the tekton-operator-proxy-webhook Service selector unintentionally matched both proxy-webhook and main operator pods, causing intermittent admission webhook failures and rejected TaskRun Pod creation.

Changes:

  • Update the proxy-webhook Deployment selector + pod template label to use name: tekton-operator-proxy-webhook.
  • Update the proxy-webhook Service selector to match the new pod label (Kubernetes + OpenShift manifests).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cmd/kubernetes/operator/kodata/webhook/webhook.yaml Aligns proxy-webhook Deployment/Service selectors to target only proxy-webhook pods on Kubernetes.
cmd/openshift/operator/kodata/webhook/webhook.yaml Same selector/label fix for the OpenShift manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/kubernetes/operator/kodata/webhook/webhook.yaml
Comment thread cmd/openshift/operator/kodata/webhook/webhook.yaml
@jkhelil

jkhelil commented Feb 22, 2026

Copy link
Copy Markdown
Member

@bowling233 , thank for your PR.

  • can you check what happens to existing clusters during upgrade? ( Install 0.78.1 and then apply your change)
    Please describe and post a proof that upgrade is working and not broken

@anithapriyanatarajan

Copy link
Copy Markdown
Contributor

@bowling233 - Request you to address the review comment, if you are still pursuing this PR. Thank you. 🙇‍♀️

@bowling233

Copy link
Copy Markdown
Author

Hi @anithapriyanatarajan,

So sorry for the late response! I won't have the bandwidth to properly validate these changes until next month.

I can confirm this approach has side effects—specifically, the HorizontalPodAutoscaler is hitting FailedGetResourceMetric errors because it's incorrectly picking up the main operator pods, which lack the expected CPU requests in the tekton-operator-lifecycle container.

This PR definitely needs more refinement to handle the selector immutability and the HPA configuration. Should I move this to a Draft for now, or would you prefer I close this and resubmit once I've validated a full fix?

@bowling233 bowling233 marked this pull request as draft March 10, 2026 14:34
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
@anithapriyanatarajan

Copy link
Copy Markdown
Contributor

@bowling233 - would it be possible to progress this PR?

Both the `tekton-operator` and `tekton-operator-proxy-webhook`
Deployments label their Pods with `name: tekton-operator`. The
`tekton-operator-proxy-webhook` Service uses this same label as
its only selector, so it inadvertently load-balances traffic
across both Deployments. Because `tekton-operator` pods do not
serve on port 8443, ~50% of admission webhook requests fail:

  failed calling webhook "proxy.operator.tekton.dev":
  Post ".../tekton-operator-proxy-webhook.../defaulting":
  dial tcp <ClusterIP>:443: connect: connection refused

Because MutatingWebhookConfiguration has `failurePolicy: Fail`,
each such failure immediately rejects TaskRun Pod creation.

Rename the proxy-webhook Deployment's selector matchLabels and
pod template label from `name: tekton-operator` to
`name: tekton-operator-proxy-webhook`, and update the Service
selector to match. The `app: tekton-operator` label is left
unchanged. Applies to both Kubernetes and OpenShift manifests.

Adding a set-based (NotIn) expression to the Service selector
instead was not viable as Kubernetes Services only support
equality-based (matchLabels) selectors.
@tekton-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign anithapriyanatarajan after the PR has been reviewed.
You can assign the PR to them by writing /assign @anithapriyanatarajan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bowling233

Copy link
Copy Markdown
Author

@jkhelil Thanks for the review. I checked the upgrade path and also clarified why this issue happens in our environment.

The root cause is that our deployment uses a single namespace for both the operator and its operands. We install the release manifest through Kustomize with namespace: tekton, and our TektonConfig uses spec.targetNamespace: tekton. In this layout, both the tekton-operator pod and the tekton-operator-proxy-webhook pod can exist in the same namespace.

Before this fix, the proxy webhook Service selected only name: tekton-operator, so in this single-namespace layout it could select both pods. The operator pod does not serve the webhook on 8443, which explains the intermittent admission failures. This is probably less visible in the default two-namespace layout because the operator pod and proxy webhook Service are separated by namespace.

While testing upgrade from v0.78.1, I found one more issue: changing Deployment.spec.selector directly is not upgrade-safe because that field is immutable. The update fails with:

Deployment "tekton-operator-proxy-webhook" is invalid:
spec.selector: Invalid value: {"matchLabels":{"name":"tekton-operator-proxy-webhook"}}:
field is immutable

So I added upgrade handling in the TektonInstallerSet reconciler: when an existing Deployment selector differs from the desired selector, the reconciler deletes the old Deployment and lets the next reconcile recreate it with the new selector.

Validation details

I verified the failure mode by simulating the old selector behavior in minikube:

  • proxy webhook Service selector: name=tekton-operator
  • endpoints included both the real proxy webhook pod and a same-namespace non-webhook pod
  • creating a pod that matched the proxy webhook objectSelector intermittently failed with:
failed calling webhook "proxy.operator.tekton.dev":
Post "https://tekton-operator-proxy-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s":
dial tcp <ClusterIP>:443: connect: connection refused

After applying the fix:

  • the Deployment selector, pod label, and Service selector all use name: tekton-operator-proxy-webhook
  • the Service endpoints only include the proxy webhook pod
  • matching pod creation succeeds and is mutated by the proxy webhook
  • upgrade from v0.78.1 to this branch completes successfully, with TektonConfig, TektonPipeline, and TektonTrigger all Ready

@bowling233 bowling233 marked this pull request as ready for review June 11, 2026 21:32
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
Delete and recreate managed Deployments when their selector changes, because Kubernetes treats spec.selector as immutable. This keeps upgrades working for proxy-webhook selector migrations.

Co-authored-by: OpenAI Codex <codex@openai.com>
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tekton-operator-proxy-webhook Service selector matches operator pods, causing ~50% webhook admission failures

5 participants